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Abstract 

The probabilistic-stream model was introduced by Jayram et al. [16j . It is a generalization 
of the data stream model that is suited to handling "probabilistic" data where each item of 
the stream represents a probability distribution over a set of possible events. Therefore, a 
probabilistic stream determines a distribution over potentially a very large number of classical 
"deterministic" streams where each item is deterministically one of the domain values. The 
probabilistic model is applicable for not only analyzing streams where the input has uncertainties 
(such as sensor data streams that measure physical processes) but also where the streams are 
derived from the input data by post-processing, such as tagging or reconciling inconsistent and 
poor quality data. 

We present streaming algorithms for computing commonly used aggregates on a probabilistic 
stream. We present the first known, one pass streaming algorithm for estimating the AVG, 
improving results in [16] . We present the first known streaming algorithms for estimating the 
number of DISTINCT items on probabilistic streams. Further, we present extensions to other 
aggregates such as the repeat rate, quantiles, etc. In all cases, our algorithms work with provable 
accuracy guarantees and within the space constraints of the data stream model. 

1 Introduction 

In order to deal with massive data sets that arrive online and have to be monitored, managed and 
mined in real time, the data stream model has become popular. In stream A, new item at that 
arrives at time t is from some universe [n] := Applications need to monitor various 

aggregate queries such as DISTINCT, MEDIAN, HEAVY-HITTER, etc. in small space, typically in 
poly(log n) where poly is some polynomial. Data stream management systems are becoming mature 
at managing such streams and monitoring these aggregates on high speed streams such as in IP 
traffic analysis [9], financial and scientific data streams [3], and others. See [21 [19] for surveys of 
systems and algorithms for data stream management. 

We go beyond this classic data stream model and consider a recently introduced generalization 
called the probabilistic stream model [16] . The central premise here is that each item in the stream 
is not simply one from the universe, but represents a likelihood of a set of items from the universe. 
Thus, the input is "probabilistic" where each item on the stream represents a probability distribu- 
tion over a set of possible events. For example, consider the stream of search queries posed by users 
that arrives at a search engine. Based on analyzing each such query, say for example "Stunning 
Jaguar" , one may be able to consider the probabilities associated with the different topics of interest 
to the user, such as "Car, Pr(Car)=0.2" , "Animal, Pr( Animal) =0.7" and "Movie, Pr(Movie)=0.1". 
Such a probabilistic stream can be thought of as a probability distribution over potentially a very 
large number of classical "deterministic" streams we described first. In monitoring such a stream, 
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the aggregates we will now be interested in are the expected values of the aggregates over the prob- 
abilistic streams. For example, we may be interested in the likely number of queries to the topic 
"Car" in the search query stream above. 

The probabilistic stream model is a natural generalization of the classical data stream model 
and applies in a variety of ways. There are instances when the input data is probabilistic by its 
nature: for example, when we measure physical quantities such as light luminosity, temperature 
and others, we actually measure an average of these quantities over many tiny time instants in the 
analog world to get a single digital reading. There are instances where the input data is uncertain 
such as due to the calibration of measurement devices and the noise in measurements. There are 
also instances when the stream is derived from the input and this introduces the probabilities. For 
example, the original motivation for the probabilistic stream model |16] was studying uncertainties 
in data associated with data cleaning and reconciliation that resulted in probabilistic databases. 
There are also other instances when the derived data stream is probabilistic. The motivating 
example we provided above is such an instance in which the input is stream of web search queries 
and the derived stream is a stream of probability distributions over topics of queries rather than 
the search terms. 

Previous work on probabilistic streams has included data stream algorithms for certain aggre- 
gate queries [U [16]. In this paper, we improve those results and study additional aggregate 
functions which are quite fundamental in any stream monitoring. In what follows, we will first 
describe the model precisely and then state our results. 



1.1 The Probabilistic Stream Model and Our Problems 



Definition 1 (Probabilistic Stream). A probabilistic stream is a data stream A = (a%, 02, . . . , a m ) 
in which each data item on encodes a random variable that takes a value in [n] U {J-}. In particular 
each a,i consists of a set of at most I tuples of the form (j,p)) for some j £ [n] and p 1 - £ [0, 1]. 
These tuples define the random variable Xi where Xi = j with probability p % - and Xi = _L otherwise. 
We define p 1 ^ = Pr [Xi = _L] and Pj = if not otherwise determined. 

A probabilistic stream naturally gives rise to a distribution over "deterministic" streams. Specif- 
ically we consider the ith element of the stream to be determined according to the random variable 
Xi and each element of the stream to be determined independently. Hence, 

Pr [(xi, x 2 , ■ ■ ■ , x m )] = ] [ Pr [Xi = x t ] = ] [ p l x . . 



We will be interested in the expected value of various quantities of the deterministic stream 
induced by a probabilistic stream. 

Definition 2. We are interested in the following aggregate properties: 



1. SUM = E 



i€[m]:Xi^± 



Xi 



2. COUNT = E [\{i £ [m] :Xi^±) 



3. AVG = E 



4. DISTINCT = E [\{j £ [n] : 3i £ [m], X { = 
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5. MEDIAN = x such that, 



E[\{i£ [m] : Xi < x}\] < 1/2 [COUNT] and E [\{i G [m] : X { > x}\] < 1/2 [COUNT] 



6. REPEAT-RATE = E 



EieM\{^H-Xi=j}\' 



These are fundamental aggregates for stream monitoring, and well-studied in the classical stream 
model. For example, REPEAT-RATE in that case is the Fi estimation problem studied in the 
seminal pQ. DISTINCT and MEDIAN have a long history in the classical data stream model [181 112] . 
In the probabilistic stream model, we do not know of any prior algorithm with guaranteed bounds 
for these aggregates except the AVG (SUM and COUNT are trivial) which, as we discuss later, we 
will improve. 

As in classical data stream algorithms, our algorithms need to use polylog(n) space and time 
per item, on a single pass over the data. It is realistic to assume that I = O(polylogn) so that 
processing each item in the stream is still within this space and time bounds. In most motivating 
instances, I is likely to be quite small, say 0(1). Also, since the deterministic stream is an instance 
of the probabilistic stream (with each item in the probabilistic stream being one deterministic 
item with probability 1), known lower bounds for the deterministic stream model carry over to 
the probabilistic stream model. As a result, for most estimation problems, there does not exist 
streaming algorithms that precisely calculate the aggregate. Instead, we will focus on returning 
approximations to the quantities above. We say a value Q is an (e, (^-approximation for a real 
number Q if \Q — Q\ < eQ with probability at least (1 — 5) over its internal coin tosses. Many 
of our algorithms will be deterministic and return (e, (^-approximations. When approximating 
MEDIAN it makes more sense to consider a slightly different notion of approximation. We say x is 
an e-approximate median if 

E[\{i€ [m] :Xi<x}\}< (1/2 + e) [COUNT] and E [\{i G [m] : X; > x}\] < (1/2 + e) [COUNT] . 



Related Models: The probabilistic stream model complements another stream model that has 
been recently considered where the stream consists of independent samples drawn from a single 
unknown distribution [8j 114] . In this alternative model, the probabilistic stream A = (a\, . . . ,a±) 
consists of a repeated element encoding a single probability distribution over [n] , but the crux is that 
the algorithm does not have access to the probabilistic stream. The challenge is to infer properties 
of the probability distribution a\ from a randomly chosen deterministic stream. There is related 
work on reconstructing strings from random traces [T7]. Here, each element of probabilistic 
stream is of the form {(i, 1 — p), (_L,p)} for some i £ [n]. As before, the algorithm does not have 
access to the probabilistic stream but rather tries to infer the probabilistic stream from a limited 
number of independently generated deterministic streams. The results in [141 El [17] do not 
provide any bounds for estimating aggregates such as the ones we study here when input is drawn 
from multiple, known probability distributions. 



1.2 Our Results 

We present the first single-pass algorithms for estimating the aggregate properties AVG, MEDIAN, 
REPEAT-RATE, and DISTINCT. The algorithms for AVG and MEDIAN are deterministic while the 
other two algorithms are randomized. While it is desirable for all the algorithms to be deterministic, 



3 



this randomization can be shown to be necessary using the standard results in streaming algorithms, 
eg., of Alon, Matias, and Szegedy [U Proposition 3.7]. 

Throughout this paper we assume each probability specified in the probability stream is either 
or 0(1/ poly(n)). The space-complexity of our algorithms are as follows. 

• We present a single pass, deterministic, (e, 0)-approximation for AVG using 0(e _1 log(nm/e)) 
space. Further, if the COUNT of the stream is sufficiently high then our algorithm needs only 
0(1) words of space. 

The best known previous work [16] presents an O(logra) pass algorithm using 0(e -1 log 2 n) 
space. Thus our algorithm substantially improves the number of passes needed and indeed 
works in the one-pass constraint of a streaming model. 

• We present the first known streaming algorithms for other aggregates listed earlier. The rest 
of our results are summarized in the table below. 

Space Randomized/Deterministic 
DISTINCT 0(e- 5 lognlog 2 <5~ 1 ) Randomized 
REPEAT-RATE 0(e~ 2 (log?7, + logm) log d^ 1 ) Randomized 
MEDIAN 0(e~ 2 logm) Deterministic 

The aggregates such as AVG, SUM and COUNT appear to be trivial to estimate, at first glance. 
That is the case with deterministic streams. For randomized streams, SUM and COUNT can still 
be estimated easily since they are linear in the input and hence we can write straightforward for- 
mulas to compute them. However AVG is not simply SUM/COUNT as shown in [T6] and needs 
nontrivial techniques. The central difficulty is that it is nonlinear and hence the principle of "lin- 
earity of expectation" is not directly useful. The other aggregates such as DISTINCT, MEDIAN and 
REPEAT-RATE have previously known algorithms for deterministic streams. A natural approach 
therefore would be to randomly instantiate multiple independent deterministic streams, apply stan- 
dard stream algorithms and then apply a robust estimator. This approach works to an extent but 
typically gives poor approximations to small quantities. This is the case for DISTINCT. Further- 
more, this approach necessitates a high degree of randomization. For MEDIAN and REPEAT-RATE, 
we show that it is possible to deterministically instantiate a single deterministic stream and run 
standard stream algorithms. 

2 Average, Sum, and Count 

For the duration of this section let Y and Z be the random variables defined by, 

Y = ^ Xi and Z = \{i G [m] : X; ^ _L}| . 

In this section we are interested in estimating AVG = E [Y/Z]. It was shown in [16] that E [Y] / E [Z] 
can be an arbitrarily poor approximation for AVG. 

We consider two different cases. We show that if stream has a sufficiently long expected length, 
that is, COUNT is large, then E[Y]/E[Z] is a guaranteed approximation for AVG. This will 
follow because Z will be tightly concentrated around E[Z]. Subsequently we show AVG can be 
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approximated for streams with shorter expected length. This needs a different idea: if COUNT is 
not large then it is sufficient to estimate AVG from Pr [Z = z] and E[Y\Z = z] for a relatively small 
range of values of z. When combined, this gives us a single pass algorithm improving upon the 
algorithm presented in [16] that uses O(logn) passes. 



2.1 Streams with long expected length 
Lemma 1. //COUNT > 12e~ 2 ln(10nme _1 ) then, 

SUM 



COUNT 



AVG 



< eAVG 



Proof. Let A be the event that \Z - COUNT] > eCOUNT/2. Assume that COUNT > 12e~ 2 ln(2(5" 1 ; 
and so, by an application of the Chernoff-Hoeffding bounds, 



Pt[A] < 2exp(-e 2 COUNT/12) < 5 . 



Set 5 = e/(5nm). Then, 



SUM 



COUNT 



AVG 



E [Y] 



< 



< 



< 



< 



< 



E[Z] 
E[Y] 



Y / l ^[Y = y,Z = z,A]-Y, l ^[y = y,Z = z, -iA] 



L),Z 



E[Z] 
E[Y] 



,A] 



+ J]^Pr [Y = y,Z = z,A] 



E[Z] E 
E[Y] 



E [Z] E [Z] 



L-^Pr[Y = y,^4] 

L J y 

1 Y^v^[ Y = y^ A \ 



+ 2Elz]^ yFr i Y = y^ A } + nFr l A \ 

I J y 

eE [Y] 
+ 2Ejz] + n6 



eE[Y] 
2E [Z] 
9e E [Y] 
WE[Z] 



+ nm5 + n5 



where the last line follows since E[Y] > E[Z]. Consequently, for sufficiently small e, we deduce 



that 



SUM 
COUNT 



AVG < eAVG as required. 



□ 



COUNT and SUM can be trivially computed exactly in a single pass because, COUNT = 
E 4G [m]( 1 - Pi) and SUM = £ ig[m] E [X^Xi ^ ±] (1 —pj_). This leads to the following theorem. 

Theorem 1. //COUNT > 12e~ 2 ^(lOnme- 1 ) then we can deterministically estimate AVG in a 
single pass using O(logn) space. 



2.2 The General Case 

In this section we will remove the assumption that COUNT > 12e -2 ln(10nme -1 ) but our algorithm 
will require more space. The main idea behind our algorithm is that if COUNT is not large then it 
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is sufficient to estimate AVG from Pr [Z = z] and E\Y\Z = z] for a relatively small range of values 
of z. 

The next lemma shows that it is possible compute the probability that Z takes various values 
if we assume that at each point during the evolution of the stream, the number of elements that 
have appeared does not differ significantly from its expected number. 

Lemma 2. For j € [in], let Zj = \{i £ [j] : X{ ^ _L}| andYj = J2i<j-x t ^±Xi- Let C™ be the event 
that, 

Vie [j], \Zi-E[Zi]\<w . 

It is possible to compute A z = Pr [Z = z, C™] and B z = yPr [Y = y, Z = z, C™] for all z in a 
single pass using 0(w\ogn) bits of space. 



Proof. Let Aj >z = Pr 
j,z G [m], 



Zj — z, Cj- 



and B hZ = V y Pr Yj = y, Zj = z, Cf 



. First note that for 



A j>z = Pr [Z 3 _ x = z, Xj = _L, CJ] + Pr [Z,-_i =z-l,X j ^ J_, Cj" 
^■_i i2 _i(l - p5_) + Aj_ hzP { \f[z — E [Zj] | < w 



otherwise 
where = 1 if z = and otherwise. Also, on the assumption that |z — E [Zj] \ < w, 

b j,z = ^t/(Pr[F H = ^H = a J = i,c;] 



+ Pr |>i-i = - °» = ^ - 1, ^ = a, 

ae[n] 

= pi^-i,, + (1 - pi) ^ Pr [X, = a\Xj / L] y Pr [Yj_! = y - a, Z,-_i = * ~ 1, Cf] 

a y 

= p> ± Bj_ ljZ + (1 - p5_) ^ Pr [X, = a|X,- ^ 1] (aPr [Zj_! = * - 1, Cf] + £j- M -i) 

a 

= p^-i,* + (1 - p5_)(£? [Xj\Xj ± ±] Aj_ x ^ x + Bj_ M _i) 

and B 0> * = for all z. If \z-E[Zj] \ > w then clearly, B jyZ = 0. Lastly £ [Zj] = E [Zj_i] + (l-p j A _). 
Hence we can compute (A jjZ ) < z < m , (B jtZ ) < z < m and £ [Zj] from (Aj_ 1)Z ) < z < m , (Bj^ ltZ )o< z < m 
and E 1 [Zj_i]. The space bound follows because at most 0(w) of (A,>)o<z<m and (-Bj,z)o<2<m are 
non-zero for any j. □ 

Hence, if we choose w to be large enough such that C™ is a sufficiently high probability event 
or that COUNT > 12e -2 ln(10reme -1 ), then we can use the above lemma to obtain a deterministic 
algorithm for estimating AVG. We show: 

Theorem 2. AVG can be computed to (e, 0) approximation in a single pass using 0(e _1 (logn + 
log m + log e" 1 ) log n) bits of space. 
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Proof. Let c = 12e~ 2 lnfTOnme" 1 ). First, if COUNT > c then by Theorem Q] SUM/COUNT is an 
(e, 0) approximation of AVG. Alternatively assume that COUNT < c. Let w = ec. By an application 
of the Chernoff-Hoeffding and union bounds, 

Pr [C™] > 1 - Pr i\ Z i - E [%] I > H > 1 - 2mexp(-e 2 c/3) > 1 - e/4n 

i£[m] 

Ylz z 2/ P r = Vi Z = z, C™] can be computed in 0(w log n) space as described in Lemma EJ 
But then, 



^-]TyPr[Y = y,Z = z,C^]-AVG 



« 2/ 



= ^\Pr[Y = y,Z = z,C%}-Pr[Y = y,Z = z] 

y,z 

< nPrhC-] 
e . 



This translates into a (e,0) approximation since AVG > 1. 



□ 



Precision Issues: One remaining issue that needs to be addressed is the precision to which 
the various quantities are calculated. In Lemma [21 it was assumed that all the arithmetic could 
be performed exactly. However, it is very likely that, for some z,j both Aj z and Bj tZ can be 
exponentially small and yet non-zero. We can easily show that errors introduced by a lack of 
precision do not accumulate sufficiently to cause a problem. In particular, let us assume that 
the calculation of each Aj tZ is performed with the addition of error j3. Then, given that Aj z = 
Aj-x jZ -i(l — jpjj + A,-_i i2 pj_, if it is non-zero, a simple induction argument shows that A, j2 is 
computed with error at most 

(pi + l-p J ± )(j-l)/3 + /3 = j/3 . 
Similarly, Bj tZ is computed with error at most, 

- I) 2 /? + (1 - pl)(n(j - 1)0 + n(j - If (5) + (3 < j 2 n(3 . 

Therefore j3 = em~ 2 n~ 2 suffices to compute z | Pr [Y = y, Z = z, C™] with additive-error at 
most e which translates into a relative-error since the quantity being estimated is 0(1). 



3 Distinct Items 

In this section we present a (e, (^-approximation algorithm for DISTINCT. In contrast to the 
algorithm for AVG, a major part of our algorithm is to actually randomly instantiate numerous 
deterministic streams and to compute the average value of the number of distinct values in these 
streams. However, this general approach will only give an (e, S) approximation in small space if 
the expected number of distinct values is not very small. In the following theorem we show that it 
is possible to also deal with this case. Furthermore, the random instantiations will be performed 
in a slightly non-obvious fashion for the purpose of achieving a tight analysis of the appropriate 
probability bounds. 

Theorem 3. We can (e, 6) -approximate DISTINCT in one pass and 0(e~ 5 log n log 2 S~ 1 ) space. 
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Proof. First note that, 

DISTINCT = ^ Pr [3i e M, x i = o\ = I 1 ~ II 0- 

je[n] je[n] \ ie[m] 

Consider COUNT = ~p\)- Then, 

e" COUNT COUNT < 1 -e- C0UNT < 1 - (1 —Pj) < DISTINCT < Pj = COUNT . 

i£[m]j'S[n] ie[m],je[ra] 

Hence if COUNT < ln(l + e) then COUNT is an (e, 0) approximation for DISTINCT. We now assume 
that COUNT > ln(l + e) and hence 

DISTINCT > ln(l + e) e - ln ( 1+e ) > e/2 

assuming e is sufficiently small. 

Consider the following basic estimator X, 

1. For each tuple (j,p l j) in aj, put j in an induced stream A' with probability p^ 

2. Compute an (e/3, <5/(2ci)) approximation X of Fq(A') using [4]. 

Note that the stream A' is not generated according to the distribution defined by the proba- 
bilistic stream because more than one item can be generated for a given a^. However, the expected 
number of distinct elements in A' is the same as the expected number of distinct elements in a 
stream generated by the probabilistic stream A. In particular E [X] = (1 ± e/3)DISTINCT. The 
reason for generating A' in this way is that we may argue that the events that i and j appear in 
A' are independent. This will be important in our analysis. 

We will compute c\ = 3 3 • 2e~ 3 ln(4/<5) of these basic estimators and average them to produce 
our final estimator Y. The accuracy and probability of success of our algorithm follows from the 
following claim. 

Claim 3. Pr [\Y - DISTINCT] < eDISTINCT] < 5 

Proof. First we effectively assume that Fq(A') can be computed exactly. Let Z be the random 
variable corresponding to the average of c\ copies of Fq(A'). E [Z] = DISTINCT. Note that c\Z can 
be thought of as the sum of nc\ independent boolean trials: c\ trials corresponding to each j £ [n] 
(some trials may have zero probability of success) and hence, 

Pr [\Z - DISTINCT] < (1 + e/3)DISTINCT] = Pr [\ Cl Z - ciDISTINCT] < (1 + e/3)ciDISTINCT] 

< 2exp(-e 2 ciDISTINCT/27) 

< 2exp(-e 3 c 1 /(2 • 27)) 

< 5/2 . 

With probability at least 1 — c\5/(2c\) = 1 — 5/2 each of the c\ calls to the distinct element 
counter returns an e/3 approximation. Hence with probability at least 1 — 5/2, Y = (1 ± e/3)Z. 
Therefore with probability at least 1 — 5, 

Y = (1 ± e/3)Z = (1 ± e/3) 2 DISTINCT < (1 ± e)DISTINCT 

assuming e < 1/4. □ 
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The space bound follows because for each of our 0(e 3 log 5 1 ) basic estimators, the distinct 
element counter requires 0((e~ 2 + log n) log <5 -1 ) space [3]. □ 

4 Other Aggregates 

In this section, we present streaming algorithms for other aggregates such as REPEAT-RATE and 
MEDIAN. In both cases, the approach is to reduce the problem to ones over the deterministic 
streams, and the algorithms are pleasantly simple. 

Estimating REPEAT-RATE: We present an (e, (^-approximation for REPEAT-RATE. The follow- 
ing result is based on reducing the problem to estimating Fi in a deterministic stream. In contrast 
to the result on estimating DISTINCT, this reduction can be done deterministically. We then use 
an algorithm by Alon, Matias, and Szegedy [TJ to estimate the second frequency moment of the 
deterministic stream. 

Theorem 4. There exists a single pass algorithm that (e, 5) -approximates REPEAT-RATE using 
0(e _2 (log m + log n) log d^ 1 ) bits of space. 

Proof. We start by re-writing REPEAT-RATE as follows. 

REPEAT-RATE = ^ E [\{i : X { = j}\ 2 ] 

je[n] 

= e( e *w+i>5 

= E (e^) +EW( 1 -rf) 

je[n] \ie[m] J ie[m] 

Note that this first term can be (e, <5)-approximated using 0(e~ 2 log5 _1 (logm-|-logn)) space using 
an algorithm of Alon, Matias, and Szegedy [Tj. The second term can be computed exactly in 
O(logm) space. Since both terms are positive we get an (e, (^-approximation to REPEAT-RATE in 
the space claimed. □ 

Estimating MEDIAN: We present an algorithm for finding an e-approximate MEDIAN. Again 
our result is based on a deterministic reduction of the problem to median finding in a deterministic 
stream. To solve this problem we use the algorithm of Greenwald and Khanna [13] . 



Theorem 5. There exists a single pass algorithm finding an e-approximate MEDIAN in 0(e 1 log m) 
space. 

Proof. The idea is use the selection algorithm of Greenwald, Khanna [13], on an induced stream as 
follows. 

1. For each tuple (j,Pj) in aj, put 2mp^e copies of j in an induced stream A'. 
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2. Using the algorithm of Greenwald, Khanna [13] , find an element I such that 

\{i E A : 1 < i < l}\ < (1/2 + e/2)\A'\ and \{i e A : I < i < n}\ < (1/2 + e/2)\A'\ . (1) 
where \A'\ denotes the length of the stream A' . 

I 2mp* e ~ 1 I 

Note that p*- > 2ml- 1 — Pj ~ e /(^ m )- Therefore, dividing Eq. [TJby 2me~ 1 yields 

-e/2 < (l/2+e/2) [COUNT] and I p)\ -e/2 < (l/2+e/2) [COUNT] . 

\Z<j<n,ie[m] / 

The result follows since COUNT>0. □ 

5 Concluding Remarks 

A number of remarkable algorithmic ideas have been developed for estimating aggregates over 
deterministic streams since the seminal work of [TJ. Some of them are applicable to estimating 
aggregates over probabilistic streams such as in estimating MEDIAN and REPEAT-RATE by suitable 
reductions, but for other aggregates such as DISTINCT and AVG, we need new ideas that we 
have presented here. The probabilistic stream model was introduced in [16] mainly motivated 
by probabilistic databases where data items have a distribution associated with them because of 
the uncertainties and inconsistencies in the data sources. This model has other applications too, 
including in the motivating scenario we described here in which the stream (topic distribution of 
search queries) derived from the deterministic input stream (search terms) is probabilistic. We 
believe that the probabilistic stream model will be very useful in practice in dealing with such 
applications. 

There are several technical and conceptual open problems. For example, could one characterize 
problems for which there is a (deterministic or randomized) reduction from probabilistic streams to 
deterministic streams without significant loss in space bounds or approximations? We suspect that 
for additive approximations, there is a simple characterization. Also, can we extend the solutions 
for estimating the basic aggregates we have presented here to others, in particular, geometric 
aggregates [15] or aggregate properties of graphs [T0~1 fTTj? 
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